Tech Arsenal 1

home *** CD-ROM | disk | FTP | other *** search

/ Tech Arsenal 1 / Tech Arsenal (Arsenal Computer).ISO / tek-15 / phbench.zip / ARTICLE < prev next >

Wrap

Text File | 1993-01-04 | 15KB | 399 lines

[The following article appeared in "C Users Journal" May 1988. It describes the purpose and use of the enclosed benchmarks. ] SIMPLE BENCHMARKS FOR C COMPILERS by Thomas Plum Dr.Plum is the author of several books on C, including Efficient C (co- authored with Jim Brodie). He is Vice-Chair of the ANSI X3J11 Committee, and Chairman of Plum Hall Inc, which offers introductory and advanced sem- inars on C. Copyright (c) 1988, Plum Hall Inc We are placing into the public domain some simple benchmarks with several appealing properties: They are short enough to type while browsing at trade shows. They are protected against overly-aggressive compiler optimizations. They reflect empirically-observed operator frequencies in C programs. They give a C programmer information directly relevant to programming. In Efficient C, Jim Brodie and I described how useful it can be for a pro- grammer to have a general idea of how many microseconds it takes to execute the "average operator" on register int's, on auto short's, on auto long's, and on double data, as well as the time for an integer multiply, and the time to call-and-return from a function. These six numbers allow a programmer to make very good first-order estimates of the CPU time that a particular algorithm will take. The following easily-typed benchmark programs determine these times directly. The first one is benchreg.c ("benchmark for register opera- tors"): - 1 - - 2 - 1 /* benchreg - benchmark for register integers 2 * Thomas Plum, Plum Hall Inc, 609-927-3770 3 * If machine traps overflow, use an unsigned type 4 * Let T be the execution time in milliseconds 5 * Then average time per operator = T/major usec 6 * (Because the inner loop has exactly 1000 operations) 7 */ 8 #define STOR_CL register 9 #define TYPE int 10 #include <stdio.h> 11 main(ac, av) 12 int ac; 13 char *av[]; 14 { 15 STOR_CL TYPE a, b, c; 16 long d, major, atol(); 17 static TYPE m[10] = {0}; 18 19 major = atol(av[1]); 20 printf("executing %ld iterations0, major); 21 a = b = (av[1][0] - '0'); 22 for (d = 1; d <= major; ++d) 23 { 24 /* inner loop executes 1000 selected operations */ 25 for (c = 1; c <= 40; ++c) 26 { 27 a = a + b + c; 28 b = a >> 1; 29 a = b % 10; 30 m[a] = a; 31 b = m[a] - b - c; 32 a = b == c; 33 b = a | c; 34 a = !b; 35 b = a + c; 36 a = b > c; 37 } 38 } 39 printf("a=%d0, a); 40 } If you enter this and compile it to produce an executable program, you can invoke it with one argument, the number of iterations for the major loop: benchreg 10000 If this execution takes 16 seconds, this means that the average register operation takes 1.6 microseconds (16,000 milliseconds divided by 10,000 iterations of the major loop). Let us examine the program in detail. Lines 8 and 9 define STOR_CL ("storage class") and TYPE to be register and int . Thus, on line 15, three variables ( a , b , and c ) are declared to be of this storage class and type. At line 16, the major loop control variables are long integers, but they are touched only one one-thousandth as often as the inner loop - 3 - variables, so they have little effect upon the timings. We are declaring the atol function to return a long integer; it would otherwise default to an int return. (If we were using a compiler based upon draft ANSI C, we could #include <stdlib.h> to get the declaration of atol , but this would limit the applicability of the benchmarks. This simple declaration is all that even an ANSI compiler would need.) At line 19, we set the major loop variable to the number given on the com- mand line, and at line 20, we confirm it to the output. Line 21 is crucial to preventing some overly aggressive optimizations. Ear- lier versions of these benchmarks had simply initialized a and b to 1, but this allows a compiler to forward-propagate a known constant value. The expression av[1][0] gives the first digit-character of the command-line argument; subtracting '0' produces a digit between 0 and 9. (Yes, the latest ANSI draft now guarantees that the digit characters are a contiguous sequence in any environment.) Line 22 simply executes the major loop the number of times given by the variable major . Line 25 repeats the inner loop 40 times, and with 25 operators in that loop, this produces 1000 operators. (Actually there are 1003, because of the initialization and the extra increment and test at loop completion. The discrepancy is well within acceptable tolerances.) Within the inner loop, 40% of the operators are assignments, in keeping with the percentages reported in the original Drhystone work. Of the other operators, the most frequent are plus and minus. The sequence of operations is carefully chosen to ensure that a very aggressive optimizer cannot find any useless code sections; each result depends functionally upon previous results. Finally, the printout at line 39 is also important to preventing over- optimization. If the compiler could notice that we did nothing with the computed result, it could discard all the operations that produced that result. We have completed our perusal of the first benchmark program, benchreg.c . The second program ( benchsho.c , for short's) is derived from benchreg.c by changing lines 8 and 9: STOR_CL becomes auto , and TYPE becomes short . The program is otherwise unchanged. The third program ( benchlng.c , for long's) is obtained by leaving STOR_CL as auto and changing TYPE to long . To make the fourth program ( benchmul.c , for multiplies) we set TYPE to int , and change lines 27 through 36 to one source line which does 25 multi- plies: a = 3 *a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a*a; /* 25 * */ The fifth program ( benchfn.c , for functions) is a major rewrite. We arrange a series of function definitions for f3 , f2 , f1 , and f0 such that each call to function f0 generates exactly 1000 function-call opera- tions. In case the compiler has an aggressive optimizer, move the function f3 to a separate source file, so that the compiler cannot see how useless - 4 - it is. The global variable dummy will make the compiler think that f3 might be up to something useful. Here, then, is the benchfn.c function: 1 /* benchfn - benchmark for function calls 2 * Thomas Plum, Plum Hall Inc, 609-927-3770 3 * Let T be the execution time in milliseconds 4 * Then average time per operator = T/major usec 5 * (Because the inner loop has exactly 1000 operations) 6 */ 7 #include <stdio.h> 8 int dummy = 0; 9 10 /* f3 - lowest level function 11 * Put this in separate source file if compiler detects and 12 * optimizes useless code 13 */ 14 f3() { } 15 16 f2() { f3();f3();f3();f3();f3();f3();f3();f3();f3();f3();} /* 10 */ 17 f1() { f2();f2();f2();f2();f2();f2();f2();f2();f2();f2();} /* 10 */ 18 f0() { f1();f1();f1();f1();f1();f1();f1();f1();f1();} /* 9 */ 19 20 main(ac, av) 21 int ac; 22 char *av[]; 23 { 24 long d, major, atol(); 25 26 major = atol(av[1]); 27 printf("executing %ld iterations0, major); 28 for (d = 1; d <= major; ++d) 29 f0(); /* executes 1000 calls */ 30 printf("dummy=%d0, dummy); 31 } The sixth program ( benchdblc. , for double's ) is derived from benchlng.c by changing STOR_CL to auto , TYPE to double , and replacing the inner loop body with this slightly different version: a = a + b + c; b = a * 2; a = b / 10; a = -a; b = -a - b - c; a = b == c; b = a + c; a = !b; b = a + c; a = b > c; These changes are necessary because floating-point operands are not allowed for the shift, remainder, and bitwise operators, and because the subscript operator does not really exercise the floating-point instructions. This revised inner loop still gives us a representative mix of typical opera- tions. - 5 - This, then, completes our collection of six benchmark programs. After they are compiled to produce executable programs, the next question is "How do I time the execution?" On UNIX systems, the timing is easy -- just run the time command: $ time benchreg 10000 The sum of the "user" and "system" times will give the CPU time used by the program. More accurately, we could time the execution of zero iterations, and sub- tract that time from the time for the measured number of iterations. On MS-DOS systems, timings can be obtained, but with greater difficulty. If we create a file named CR-LF which contains just one newline (or "carriage-return-newline" in DOS parlance), we could time our program with a "batch" file such as this: time <cr-lf benchreg 0 time <cr-lf benchreg 10000 time <cr-lf We must then take times that are expressed in minutes-and-seconds and pro- duce differences expressed in seconds. With whichever method, we eventually produce six numbers that are character- istic of a particular environment (a specific compiler supporting a specific machine). [NOTE: Since this article appeared, I have added a driver program, benches.c. In an ANSI environment with the clock function, it will run all the tests and report the results, eliminating the need for manual computations.] Here are some examples of timing results that have been obtained on a variety of minicomputer and workstation environments: - 6 - Machine/compiler register auto auto int func auto int short long multiply call dbl AT&T 3B2/05 (-O) 1.36 3.87 2.62 15.4 7.7 22.5 AT&T 3B2/05 (no -O) 1.78 4.66 2.75 16.2 9.3 22.5 AT&T 3B2/400 (-O) 1.09 1.36 1.10 16.2 10.0(?) 91.4 AT&T 3B2/400 (no -O) 1.14 2.61 2.36 17.3 11.3 91.1 Apollo DN330 (-O) 1.36 .78 1.36 10.17 3.57 Apollo DN330 (no -O) 1.54 1.28 1.54 11.30 3.64 Apollo DN580 (-O) 1.03 .59 1.03 7.67 2.72 Apollo DN580 (no -O) 1.18 .97 1.18 8.48 2.77 Apollo DN660 (-O) 5.88 1.24 5.88 21.86 4.26 Apollo DN660 (no -O) 5.93 1.52 5.93 21.93 4.29 Cray X-MP (no vectors) .0567 .0656 .0822 .366 .821 .082 Masscomp 5500 3.18 2.7 4.9 30.8 7.3 Masscomp 5600 (-O) .45 .61 .46 2.83 1.04 Masscomp 5600 (no -O) .46 .78 .64 2.99 1.76 Pyramid 90X (-O) .85 1.04 .86 3.64 1.9 2.37 Pyramid 90X (no -O) .86 1.01 .86 3.65 1.8 2.34 Sequent (-O) 1.39 2.99 2.53 9.90 9.3 Sequent (no -O) 1.50 3.25 2.83 9.95 13.2 Sun 3/260HM (-O) .31 .48 .47 1.98 1.16 Sun 3/260HM (no -O) .36 .58 .57 1.99 1.62 Sun 3/75M (-O) .47 .77 .76 3.00 2.12 Sun 3/75M (no -O) .53 .95 .94 3.01 2.73 Sun 3/75M(4.2, -O) .50 .81 .83 2.85 1.5 20.7 Sun 3/75M(4.2, no -O) .54 1.00 1.01 2.97 2.7 21.1 Sun 3/75M(VM, -O) .46 .77 .75 2.96 2.1 20.8 Sun 3/75M(VM, no -O) .52 .96 .93 2.97 2.7 21.1 VAX 11/730 (-O) 4.00 9.80 6.20 16.2 42.8 12.4 VAX 11/730 (no -O) 4.73 10.2 7.45 16.57 51.5 17.0 VAX 11/780 (-O) 1.21 2.43 1.67 2.76 15.0 2.95 VAX 11/780 (BSD 4.2) 1.38 2.42 1.96 2.92 17.2 VAX 11/780 (UNIX 5.2) 1.24 2.48 1.79 2.72 15.7 3.89 VAX 11/780 (no -O) 1.29 2.51 1.85 2.70 16.7 3.89 VAX 11/785 (-O) .93 1.85 1.32 5.00 13.9 47.5 VAX 11/785 (no -O) 1.01 1.96 1.44 5.08 14.2 5.42 VAX 8650(UNIX -O) .236 .484 .298 .589 2.63 .578 VAX 8650(UNIX no -O) .258 .482 .316 .574 3.06 .791 VAX 8650(Ultrix -O) .23 .40 .29 .53 2.4 .56 VAX 8650(Ultrix no -O) .26 .41 .34 .56 2.8 .77 Notice that some of these timings were run before the benchdbl benchmark had been written. There are no examples of the popular PC environments in this table. If interested readers wish to run these benchmarks on their own environments, I will endeavor to present these results in a future article. Processor speeds are sometimes described in "MIPS" (millions of instructions per second); using a value such as the number of register operators per second in C might give rise to a "MOPS" measurement of more use to C pro- grammers. Those of us who have tried these benchmarks have appreciated the intuitive grasp that they give of the speed of current machines and com- pilers. I hope that you too will find them of interest.